Parameter expansion in local-shrinkage models 



James G. Scott 

McCombs School of Business 
University of Texas at Austin 



September 2010 



Abstract 



This paper considers the problem of using MCMC to fit sparse Bayesian 
models based on normal scale-mixture priors. Examples of this framework 
include the Bayesian LASSO and the horseshoe prior. We study the usefulness 
of parameter expansion (PX) for improving convergence in such models, which 
is notoriously slow when the global variance component is near zero. Our 
conclusion is that parameter expansion does improve matters in LASSO-type 
models, but only modestly. In most cases this improvement, while noticeable, 
is less than what might be expected, especially compared to the improvements 
that PX makes possible for models very similar to those considered here. We 
give some examples, and we attempt to provide some intuition as to why this 
is so. We also describe how slice sampling may be used to update the global 
variance component. In practice, this approach seems to perform almost as 
well as parameter expansion. As a practical matter, however, it is perhaps best 
viewed not as a replacement for PX, but as a tool for expanding the class of 
models to which PX is applicable. 

Keywords: MCMC; normal scale mixtures; parameter expansion; sparsity 

1 Parameter expansion for variance components 

Many common Bayesian models have equivalent parameter-expanded (PX) versions, 
in which redundant, non-identified parameters are introduced for the sake of improv- 
ing MCMC convergence. In particular, we are interested in the following simple case 
and its generalizations: 
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Figure 1: Above: Simulation history and autocorrelation plot for r in the non- 
parameter-expanded Gibbs sampler. Below: the same plots for the parameter- 
expanded sampler. 
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Figure 2: The directed graph (left) and moralized undirected graph (right) corre- 
sponding to the PX model, |4])-([7]). Circles indicate nodes; rectangles, replicates. 



for i = 1, . . . , n and j = 1, . . . ,p. 

The parameter-expanded (PX) model corresponding to is 



A 



g 2 ~ 



N(Aa^,cr 2 ) 
N(0,/) 
N(0,1) 
IG(l/2,l/2) 



(4) 
(5) 
(6) 
(7) 



In both cases, assume that p(a) oc 1/cr. More generally, r may have a positive 
noncentral-t prior, and a similar equivalence will hold. But the half-Cauchy prior 



is an excellent default choice for many problems (German 2006; Poison and Scott 



2010), and is a useful special case for the sake of illustration. 

Despite the fact that A and g 2 are not individually identifiable, Model @~(|7]) 
and Model ([TJ)-([3]) are identical in all the important ways: the marginal likelihood 
in y, the model for (3j, and the implied prior for r. It is easy, moreover, to translate 
between the two parameterizations, since 0j = Aa9j and r = \A\g. Yet Q-Q will 
result in an MCMC that converges faster — often drastically so, and especially when 
t is close to zero. This phenomenon has been widely studied, and has been exploited 
with great success to speed computation for many common Bayesian models (see, 



e.g., van Dyk and Meng 
For example, Figure [1 
ular simulated data set wi 



2001, for many useful references). 



compares the standard and PX Gibbs samplers for a partic- 
/here p = 2000, n = 3, r = 0.25, and a = 1.25. The standard 
sampler exhibits severe autocorrelation, while the PX sampler appears healthy. (R 
code for implementing all simulations can be found in Appendix |A|) 
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Figure 3: Two equivalent ways of expressing the non-PX local shrinkage model. 

The advantage of Model @-(|7|) is apparent from Figure [2j and arises from the 
fact that A and g are conditionally independent in the posterior distribution, given 
= (9i, . . . , dp)'. The crucial fact is that g enters the model at the level of the 
parameters, while A enters the model at the level of the data. Since r is the product 
of these two factors, each of which may vary independently in the conditional posterior 
distribution, the result is reduced autocorrelation. 



2 LASSO-type Bayesian models 
2.1 Sparsity via scale mixtures 

Many popular models for a sparse location vector (3 = (j3i, . . . , /3 P )' assume an ex- 
changeable prior for (3j, and can can be studied most readily in the generalization 
of Q-(§ to cases where (3 is a sparse vector of normal means with a non-Gaussian 
prior. We now consider the question of whether parameter expansion can offer im- 
provements similar to those available in the pure Gaussian case. 

Suppose that (jjij \ (3j,cr 2 ) ~ N((3j,a 2 ) and the prior for (3j is of the form 
p(f3j/{ar}), where r has a prior distribution. Models of this form include the relevance 



vector machine of Tipping 


(2001); the double-exponential prior or "Bayesian LASSO" 


(Carlin and Poison 


1991 


Tibshirani 1996 Park and Casella 2008; I 


lans 2009 


Gra- 


macy and Pant alec 


2010|); the normal/ Jeffreys prior (Figueiredo 


2003 ] 


3ae and 


Mallick, 2004); the Strawderman-Berger prior (Strawderman, 1971 


Bergei 




L980); 



the normal/exponential/gamma prior (Griffin and Brown, 2005); and the horseshoe 
prior dCarvalho et~aL] [20To| ) , among many others. 

As many authors have observed, autocorrelation for r in these sparse models can 



cause great difficulty; see, for example, Hans (2010 ). The logic of parameter expansion 



implies that we should instead allow two nonidentified parameters to perform the work 
of the single identified parameter r. In light of the previous example, it is natural — 
but, as it turns out, naive — to hope that this approach can solve the problem. 

The difficulty is that all of these sparse models are best fit by introducing extra 
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Table 1: Priors for Xj associated with some common sparsity priors. Densities are 
given up to constants and do not account for global scale terms. 



Marginal prior for f3j Prior for Xj 



Double-exponential Xj exp y — Xj/2 

Cauchy Xj 2 exp jl/ (2X 2 

Strawderman-Berger Xj (1 + A 2 )~ 3 / 2 

Normal-exponential-gamma Xj (1 + A 2 )~( c+1 ) 

-i ** 

Normal-Jeffreys Aj 

Horseshoe (1 + A 2 ) -1 



identified parameters {A 2 , . . . , A 2 } to the model. This is done in such as a way as 
to make the prior for /3j conditionally Gaussian, given the local shrinkage factors Xj. 
This greatly simplifies the model. But as we shall see, it also diminishes the potential 
advantage to be gained by introducing non-identified variance components. 
To see this, write the local shrinkage generalization of Q-(|3]) as 

(^■|/3„a 2 ) ~ N(a/3„a 2 ) (8) 

(f3j\X 2 j,T 2 ) ~ N(0,r 2 A 2 ) (9) 

A, ~ p(Xj) (10) 

T ~ C + (0,1), (11) 

with Table [I] listing some choices for p(Xj) that give rise to common sparsity priors. 
Alternatively, (p]) and ^ may be re-written as 

( yi j\ej,Xj,a 2 ) ~ N(a\je jl( T 2 ) (12) 
(9,\r 2 ) ~ N(0,r 2 ) (13) 

with (3j = XjOj. These two equivalent ways of writing the model are shown graphically 



in Figure |3| Note the usual conjugate form preferred by Jeffreys ( 1961 ), with the error 
variance a 2 scaling the prior for the location vector. 

The obvious PX approach is to let r = |A|g as before, and to postprocess the 
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Figure 4: Four equivalent ways of expressing the parameter-expanded local shrinkage 
model. 



MCMC draws for g and A to estimate r. For example, we might let 



(yij | 9j,Xj,A,a 2 ) ~ N(<xAA^-, a 2 ) (14) 

(djlg 2 ) ~ N(0,^ 2 ) (15) 

A, ~ p(A,) (16) 

A ~ N(0,1) (17) 

g 2 ~ IG(l/2,l/2), (18) 



This version of the model, along with three equivalent versions, are shown graphically 
in Figure [4j These versions differ in where A and Xj enter the hierarchy (that is, at 
the data level or parameter level), and correspond to different undirected graphs for 
the full joint distribution. 



2.2 The non-PX updates for r and Aj 

As an alternative to parameter expansion, we use the following approach based on slice 
sampling (see, e.g., Damien et al. 1999). Define rjj = 1/A 2 , and define fa = f3j/(ar). 
Then the conditional posterior distribution of rjj, given all other model parameters, 
looks like 
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Therefore, the following two steps are sufficient to sample Xj: 

1. Sample (uj \ T]j) uniformly on the interval (0, 1/(1 + rjj)). 

2. Sample (r]j \ fij,Uj) ~ Ex(2//x 2 ) from an exponential density, truncated to have 
zero probability outside the interval (0, (1 — Uj)/uj). 

Transforming back to the A-scale will yield a draw from the desired conditional dis- 
tribution. 

The same trick works for r, letting r\ = 1/r 2 and replacing ft 2 by ^20]/2. Indeed, 
the approach will work for any prior for which the slice region in Step 1 is invertible, 
or can be transformed to make it invertible (as for the half-Cauchy). 

Slice-sampling can also be used independently for g, A, or both. This tactic 
expands the class of variance-component priors to which parameter expansion is ap- 
plicable. For example, the noncentral positive t distribution corresponds to r = \A\g 
where A ~ N(m, 1) and g 2 ~ IG(a/2,6/2). This leads to conditionally conjugate 
updates for both A and g 2 . 

Suppose, on the other hand, that one would prefer r 2 to have some prior other 
than a noncentral positive t. Slice sampling makes this possible. For example, let 
T 2 ~ IB(a, 6), an inverted beta or "beta-prime" distribution. This generalizes the 
half-Cauchy prior in a different direction, making r 2 equal in distribution to the ratio 
of two gamma random variables, or equivalently r = \A\g where A 2 ~ Ga(a, 1) and 
g 2 ~ IG(6, 1). It is then possible to use the usual conjugate update for g 2 in the PX 
model, and to use slice sampling to update A 2 . We do not explore this fact further, 
but note that it opens up the possibility of using PX to fit models involving even 
more general classes of variance-component priors. 

All other draws are standard Gibbs updates and are omitted. 



3 Simulation results 



To compare the PX and non-PX samplers, we used the horseshoe prior of Carvalho 
et al. (2010), where Aj ~ C + (0, 1). The resulting marginal prior distribution for /3j 



has Cauchy-like tails and a pole at zero, and seems to perform very well as a default 
sparsity prior. 

The line of reasoning behind the horseshoe prior is that r should concentrate near 
zero a posteriori. This will provide a strong shrinkage effect for most observations 
(i.e. the noise). The signals, meanwhile, will correspond to very large values of Aj, 
from far out in the tail of the half-Cauchy prior, allowing certain observations to 
escape the "gravitational pull" of r toward zero. 

Because it exhibits a striking antagonism between modeling goals and compu- 
tational goals, the horseshoe prior makes for an interesting test case. Convergence 



7 



problems arise precisely when r is small. Yet the logic of the model says that r must 
be small in order to squelch noise. 

Figures [5]-[7] summarize our results for three chosen cases. In all cases, we ran 
the PX and non-PX samplers on the same data, simulated from the true model. 
We burned-in the samplers for 2 x 10 4 iterations, and saved an additional 2 x 10 4 
iterations without any thinning. The samplers were initialized to the same values, 
and were provided a stream of pseudo-random numbers from R's default generator 
starting from the same seed. Code is provided in the Appendix that allows the reader 
to replicate these results, and to change the data set or RNG seed. 

Unfortunately, it appears that the parameter-expanded sampler offers, at best, 
only a modest improvement over the non-PX sampler. In some of the cases explored, 
the advantage was small but noticeable. In other cases, there seemed to be virtually no 
difference. In no situation that we investigated did we see an improvement anything 
like that shown for the global-shrinkage-only model (Figure [T]). 

This is disappointing, given the importance of these models in Bayesian statistics 
today. It is also strikingly different from the case where p(/3j | r 2 ) is a normal 
distribution and Aj = 1. We focus on results under the horseshoe prior, but the 
behavior we witnessed here appears to be quite general for other models, too (e.g. the 
Bayesian LASSO). 

To quantify the relative efficiency of the two samplers over a variety of different 
signal-to-noise ratios, we ran the following experiment for all combinations of r G 
{0.01, 0.05, 0.1, 0.5, 1} and n e {1, 2, 3, 4, 5}. In all cases, we set p = 1000 and a = 1. 

1. Simulate data from the true model for the given configuration of r and n. 

2. Run each sampler (PX and non-PX) for T = 10 5 iterations after an initial 
burn-in period of 2 x 10 4 iterations. 

3. Estimate the effective Monte Carlo sample size as T e — T/k for 

oo 

K = 1 + 2^corr {r (0) ,r (t) } . 

i=l 

This can be estimated from the MCMC output. 

4. Compute the relative efficiency of the PX (P) and non-PX (N) samplers as 

r e = Tf )/T e W . 

For each combination of r and n, we estimated the relative efficiency for 10 dif- 
ferent simulated data sets and averaged the results. These results are summarized in 
Table [2j while Table [3] shows the average effective sample size for the PX sampler. 
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Simulation History: No PX 



Autocorrelation Function for Tau: No PX 
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Figure 5: Case 1: p = 1000, n = 5, a = r = 1. Above: Simulation history and 
autocorrelation plot for r in the non-parameter-expanded Gibbs sampler under the 
horseshoe prior. Below: the same plots for the parameter-expanded sampler. 
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Simulation History: No PX Autocorrelation Function for Tau: No PX 
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Figure 6: Case 2: p = 2000, n = 3, a = 1, r — 0.1. Above: Simulation history and 
autocorrelation plot for r in the non-parameter-expanded Gibbs sampler under the 
horseshoe prior. Below: the same plots for the parameter-expanded sampler. 
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Simulation History: No PX Autocorrelation Function for Tau: No PX 
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Figure 7: Case 2: p = 5000, n = 2, a = 1, t = 0.01. Above: Simulation history and 
autocorrelation plot for r in the non-parameter-expanded Gibbs sampler under the 
horseshoe prior. Below: the same plots for the parameter-expanded sampler. 
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Table 2: Relative efficiency ratios of the PX sampler compared to the non-PX sampler 
for 20 different configurations of n and r. Numbers larger than 1 indicate that the 
PX sampler is more efficient. 









T 






n 


0.01 


0.05 


0.1 


0.5 


1 


2 


13.66 


1.59 


1.31 


1.31 


1.49 


3 


9.78 


1.46 


1.16 


2.44 


5.06 


5 


9.55 


1.51 


0.91 


5.04 


11.89 


10 


7.36 


1.02 


0.79 


1.31 


8.70 



Table 3: Effective sample size of 10 5 samples using the parameter-expanded MCMC 
for 20 different configurations of n and r. 

T 



n 


0.01 


0.05 


0.1 


0.5 


1 


2 


1143 


422 


554 


759 


476 


3 


940 


537 


632 


570 


313 


5 


936 


637 


545 


570 


510 


10 


767 


487 


463 


264 


151 
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From these tables, it is clear that, while the PX sampler usually outperforms the non- 
PX sampler, it is still highly inefficient. Here, an MCMC run of length 100,000 tends 
to yield the equivalent of roughly 100 to 1000 independent draws from the posterior 
distribution. These tables, moreover, reflect the performance of the algorithm on a 
data set with 1000 parameters and no problems introduced by collinear predictors, 
as might be the case in a regression problem. For larger problems with collinear 
predictors, the inefficiencies would likely be much greater. 



4 A final example 

Intuitively, the failure of PX to offer large gains is due to the presence of the Aj's in 
the conditionally Gaussian representation for As the directed graphs in Figure 
[4] show, the Aj's fail to be independent of both A and g in the conditional posterior 
distribution (given 6 and the data), no matter where they enter the hierarchy. This 
induces further autocorrelation in r, even if the model is structured so that A and g 
are still conditionally independent of each other. 

One final example provides some intuition that, indeed, it does seem to be the 
A/s that foul things up. Suppose that Xj ~ N + (l,t>) for some pre-specified value of 
v. When v is small, the A/s are restricted to lie very near their prior mean of 1. 
The resulting model behaves almost exactly like the pure global shrinkage model of 
([T])-(|3]). On the other hand, when v is large, the A^-'s are free to vary, and the model 
may shrink locally rather than just globally. 

As we vary v from a large value to a small value, we may study the behavior of 
the parameter-expanded MCMC. Figure [8] shows these results for a small experiment 
t — a = 1, n = 2, and p = 1000. Notice the increasing degree of autocorrelation as v 
gets bigger. This fact suggests that giving the A/s the freedom to move — as we must 
do in order to model sparse signals — will inevitably diminish the performance of the 
parameter-expanded sampler. 



5 Summary 

On balance, parameter expansion appears to be the safest way to handle the global 



scale parameter r in sparse Bayesian models like rt8])-(ll). We recommend it as the 
default option. But in light of these results, it is difficult to be very excited about this 
or any other existing technique for fitting such models. In highly sparse situations, 
where the posterior for r is highly concentrated near zero, MCMC-PX runs of 10 5 may 
yield effective sample sizes of only a few hundred draws. This is precisely the problem 
that parameter expansion usually solves in hierarchical models. Its performance on 
this count simply isn't very encouraging, even if it is usually a bit better than that 
of the second-best option. 



13 



Simulation History: v = (0.05)^2 
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Figure 8: Three examples of the PX sampler fit to a model where Xj ~ N + (l,t>). 
Top: v = 0.05 2 . Middle: v = 0.5 2 . Bottom: v = 5 2 . 
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What kind of algorithm could solve the problem? It is the A/s, of course, that 
are at the heart of the matter; they endow the class of conditionally Gaussian models 
with the ability to handle sparsity, but they make convergence much slower. 

Therefore, an algorithm that will solve the problem of autocorrelation in r must 
marginalize over the A/s in one of two updating steps: either pifij \ \j,Tj,a,yj), or 
p(r I /3, a, A, Y). The update for (3j, however, depends intimately upon conditional 
normality, making it a poor candidate for marginalization. 

Marginalizing over the A/s in the update for r is tricky. In most cases it will 
not even be practical to evaluate the marginal likelihood p(Y | r, a) without invoking 
the Aj's. But there are some families of priors for which it is possible. In particular, 



Poison and Scott (2010) study the class of hypergeometric inverted-beta priors for 



a global variance component. This four parameter generalization of the inverted- 
beta family yields a form for p(Y | r, a) that can be expressed in terms of doubly 
nested hypergeometric series. These series, however, are sometimes painfully slow to 
converge, and we have not yet managed to exploit them to produce a "A-marginalized" 
Gibbs sampler that is competitive with standard methods. This remains an active 
area of research. As the results of this paper show, a better approach is critically 
needed. 



A R code 

A.l The non-PX sampler 

set.seed(42) 
p = 2000 
n = 3 

TauTrue =0.1 
SigmaTrue = 1 

LambdaTrue = abs (rt (p, 1) ) 

#LambdaTrue = rep(l,p) # Use if you are testing a global-shrinkage model 
BetaTrue = rnorm(p,0,SigmaTrue*LambdaTrue*TauTrue) 

Y = matrix (0, nrow=p, ncol=n) 

ford in l:n) 

{ 

Y[,i] = BetaTrue + rnorm(p,0, SigmaTrue) 
> 

Ybar = apply (Y, 1 , mean) 

Beta = rep(0,p) 
Sigma2 = 1 
Sigma = 1 
Tau = TauTrue 
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Lambda = rep(l,p) 
Sigma = sqrt(Sigma2) 

nmc = 20000 
burn = 20000 

BetaSave = matrix (0, nrow=nmc , ncol=p) 
LambdaSave = matrix (0, nrow=nmc, ncol=p) 
TauSave = rep(0, nmc) 
Sigma2Save = rep(0, nmc) 
Res2 = Y 

for(t in 1 : (nmc+burn) ) 
{ 

if(t %% 1000 == 0) cat ("Iteration ",t, "\n") 

# First block-update Beta 
a = (Tau~2)*(Lambda~2) 

b = n*a 

s = sqrt(Sigma2*a/{l+b}) 
m = {b/{l+b}}*Ybar 
Beta = rnorm(p, m, s) 
Theta = Beta/(Sigma*Lambda) 

# Now update Sigma2 

# Jeffreys prior is assumed 
ford in l:n) 

{ 

Res2[,i] = {Y[,i]-2}/{l+(Tau-2)*{Lambda-2}} 
} 

RSS = sum(Res2) 

Sigma2 = l/rgamma(l ,n*p/2, rate = RSS/2) 
Sigma = sqrt(Sigma2) 

# Now update Tau~2 using slice sampling 
eta = l/(Tau~2) 

u = runif (1,0,1/ (eta + 1)) 

ub = (l-u)/u 

a = (p+l)/2 

b = sum(Theta~2)/2 

ub2 = pgamma(ub,a,rate=b) 

u2 = runif (1,0, ub2) 

eta = qgamma(u2,a,rate=b) 

Tau = l/sqrt(eta) 

# Now update Lambda, comment out for global shrinkage only 
Z = Ybar/(Sigma*Theta) 

V2 = l/rgamma(p,l,rate=(Lambda~2+l)/2) 
numl = n*V2*(Theta~2) 
den = 1 + numl 
s = sqrt(V2/den) 
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m = {numl/den}*Z 
Lambda = rnorm(p,m,s) 

if(t > burn) 
{ 

BetaSave [t-burn,] = Beta 
LambdaSave [t-burn,] = Lambda 
TauSave [t-burn] = Tau 
Sigma2Save [t-burn] = Sigma2 
} 
} 

BetaHat = apply (BetaSave , 2 ,mean) 
LambdaHat = apply (abs (LambdaSave) , 2 ,mean) 

A.2 The PX sampler 

set.seed(42) 
p = 2000 
n = 3 

TauTrue =0.1 
SigmaTrue = 1 

LambdaTrue = abs(rt(p,l)) 

#LambdaTrue = rep(l,p) # Use if you are testing a global-shrinkage model 
BetaTrue = rnorm(p,0,SigmaTrue*LambdaTrue*TauTrue) 

Y = matrix (0, nrow=p, ncol=n) 

for(i in l:n) 

{ 

Y[,i] = BetaTrue + rnorm(p,0, SigmaTrue) 
} 

Ybar = apply (Y, 1 , mean) 

Beta = rep(0,p) 
Sigma2 = 1 
Sigma = 1 
G = 1 

Delta = TauTrue 
Tau = abs (Delta) *G 
Lambda = rep(l,p) 
Sigma = sqrt(Sigma2) 

nmc = 20000 
burn = 20000 

BetaSave = matrix (0, nrow=nmc , ncol=p) 
LambdaSave = matrix (0, nrow=nmc, ncol=p) 
TauSave = rep(0, nmc) 
Sigma2Save = rep(0, nmc) 
Res2 = Y 
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for(t in 1 : (nmc+burn) ) 
{ 

if(t 7.7. 1000 == 0) cat ("Iteration ",t, "\n") 

# First block-update Beta 
a = (Tau~2)*(Lambda~2) 

b = n*a 

s = sqrt(Sigma2*a/{l+b}) 

m = {b/{l+b}}*Ybar 

Beta = rnormCp, m, s) 

Theta = Beta/(Sigma*Delta*Lambda) 

# Now update Sigma2 

# Jeffreys prior is assumed 
ford in l:n) 

{ 

Res2[,i] = {Y[,i]~2}/{l+(Tau~2)*{Lambda-2}} 
} 

RSS = sum(Res2) 

Sigma2 = l/rgamma(l ,n*p/2, rate = RSS/2) 
Sigma = sqrt(Sigma2) 

# Now update Tau~2 

# Method 2: parameter expansion 
{ 

G = l/sqrt(rgamma(l, (p+l)/2, rate = (l+sum(Theta~2) ) /2) ) 

Z = Ybar/(Sigma*Theta*Lambda) 

a = n*(Lambda*Theta)~2 

b = sum (a) 

s2 = l/(l+b) 

m = {s2}*sum(a*Z) 

Delta = rnorm(l,m,sqrt(s2)) 

Tau = abs (Delta) *G 

} 

# Now update Lambda, comment out for global shrinkage only 
Z = Ybar/(Sigma*Delta*Theta) 

V2 = l/rgamma(p,l,rate=(Lambda~2+l)/2) 

numl = n*V2*((Delta*Theta)~2) 

den = 1 + numl 

s = sqrt(V2/den) 

m = {numl/den}*Z 

Lambda = rnorm(p,m,s) 

if(t > burn) 
{ 

BetaSave [t-burn,] = Beta 
LambdaSave [t-burn,] = Lambda 
TauSave [t-burn] = Tau 
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Sigma2Save [t-burn] = Sigma2 

} 

} 

BetaHat = apply (BetaSave , 2 ,mean) 
LambdaHat = apply (abs (LambdaSave) , 2 ,mean) 
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